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Abstract:  This  paper  describes  a  computational  comparison  of  value  iteration 
algorithms  for  discounted  Markov  decision  processes. 


1.  INTRODUCTION 


"^This  note  describes  the  results  of  a  computational  comparison  of  value 
iteration  algorithms  suggested  for  solving  finite  state  discounted  Markov 
decision  processes.  Such  a  process  visits  a  set  of  states  S  -  ^1,2,...m)  .; 


When  it  is  in  state  i  ,  one  can  choose  an  action  k  from  the  finite 


action  set  ,  and  then  receive  an  lnaediate  reward  r*  and  with  prob¬ 


ability  p^  the  process  will  be  in  state  j  at  the  next  period.  The 


object  is  to  maximize,  v(i)  ,  the  maximum  discounted  reward  over  an  in¬ 
finite  horizon  starting  in  state  i  ,  where  6  is  the  discount  factor.  It 
is  well  known  [1]  that  v(i)  satisfies  the  optimality  equation 


v(i) 


max  i 

keK, 


k  . 
ri  + 


M 

l 

J-l 


ij 


/' 

We  record  the  time  for  value  Iteration  algorithms  to  obtain  e-optimal 
solutions,  v  ,  to  (1.1),  (i.e.  |v  -  vl  <  e  ,  where  I v I  -  maxlv(i)l) 
on  randomly  generated  problems.  We  look  at  three  classes  of  fifteen  prob¬ 
lems  each  with  8-9  and  e  -  .0001,  where  v(i)  ~  2,000.  Class  1  prob¬ 
lems  have  100  states  and  between  2  and  7  actions  per  state;  class  2  have 
40  states  and  between  2  and  70  actions  per  state,  whereas  class  3  have 
10  states  and  up  to  500  actions  per  state.  Details  of  how  the  problems 
are  generated  and  computing  facilities  used  are  given  in  [12]. 

y  In  Section  two  we  describe  the  schemes  examined  and  the  various  bounds 
that  can  be  used  for  stopping  them.  Section  three  concentrates  on  one 
scheme  that  did  well  in  the  comparison  -  ordinary  value  iteration  -  and 
looks  at  various  methods  for  eliminating  non-optimal  actions  both 
permanently  and  temporarily. 


2.  SCHEMES  AND  BOUNDS 


The  scheme  usually  described  as  value  iteration  is 


Wu 


keK. 


v£(i) 


y  “  max  i 
keK, 


X 


(2.1) 


which  was  discussed  in  [1,3].  In  analogy  with  the  notation  of  linear 
equations  we  call  this  Pre-Jacobi  (PJ) .  This  analogy  leads  us  to  think  of 
the  following  alternative  schemes. 


Jacobi  (J) :  v  . ,  (i)  ■  max 

n+1  keK, 


<«i  +  6  »«  vtl(J»/<1-<) 


Pre-Gauss-Seidel  (PGS) :  v  . , (i)  *  max 

°+1  keK. 


* 8  jj  As 


M 


*»!  'u  v»o> 

j-i  J 


(2.2) 


(2.3) 


Gaus8-Seidel  (GS) 


Vi(i) 


<agsV(1)  ’  i<ri  +  8  j,  »uVi0) 


keK, 


3-1 


M 


+  0  l  v  (j))/(l-Sp!fi) 

j-i+1  3 


Successive  over  Relaxation  (SOR) :  v  (i)  *  u>(A_„v  )  (i)  +  (l-u)v  (i) 

n+i  (jo  a  n 


(2.4) 


(2.5) 


(PGS)  was  suggested  by  Kushner  [4],  Porteus  [8]  and  Reetz  [10];  (J)  and 
(GS)  is  found  in  [9]  and  SOR  in  [5].  Experiments  with  SOR  suggested  a 
value  of  w  ■  1.28  for  robust  and  speedy  convergence. 
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We  require  bounds  on  the  Iterates  of  the  scheme  to  ensure  we  stop  when 


vq  is  within  a  specified  value  of  the  optimal  v  of  (1.1).  One  can  use 
the  Lw  norm  bound,  which  says  if  |  |Q|  £  ot  <  1  for  all  possible 

transition  matrices  in  the  scheme 


/  ,  ■  max{slc  +  Q^v  } 

n+i  k  -  _n 


then  | v  -  \+i\m  1  a  lvn+1  “  vnL/(l-a)  .  For  (PJ),  (J),  (PGS)  and  (GS), 

it  is  trivial  to  show  the  corresponding  Q's  have  norm  less  then  (5  . 

For  S.O.R.  we  estimate  a  by  |v  -  v  I  /|v  -  v  .1  and  substitute  in 

1  n+1  n1*  1  n  n— l1” 

(2.6)  to  get  a  heuristic  bound. 

Porteus  [7]  described  tighter  bounds  for  these  schemes,  exploiting  the 


non-negativity  of  the  elements  q.  of  Q  in  (2.6).  They  require  cal- 

Mj-J 

culation  of  a?  ■  £  q*  for  the  maximizing  action  k  at  each  iterate, 

j-1 

and  we  call  these  the  P.C.  bounds  -  (Porteus  with  calculation).  In  [12] 


k 


we  describe  how  to  estimate  the  initially,  which  avoids  the  calcula¬ 

tion  at  each  step,  but  gives  looser  bounds,  which  we  denote  P.N.C.  - 
(Porteus  no  calculation).  For  the  (PJ)  scheme  we  also  use  the  second  order 
bounds  (S.O. )  described  in  [11],  which  uses  the  last  three  iteration  values 
to  get  a  tighter  lower  bound  than  Porteus' s  bound. 

The  results  are  given  in  the  following  cable  where  (Av)  is  the  average 


C.P.U.  time  for  solving  the  fifteen  problems,  S.D.  the  standard  deviation 
of  the  C.P.U.  time,  and  N  the  number  of  problems  that  method  was  quickest 
at  solving. 


TABLE  1 


METHOD  BOUNDS 

CLASS 

L  (100  STATE) 

CLASS  2  (40  STATE) 

AV. 

Hi 

N. 

AV. 

S.D.  N. 

PJ 

PC-PNC 

1.66 

.11 

15 

.79 

SO 

1.69 

.11 

0 

.80 

can 

J 

L« 

11.88 

.59 

0 

14.38 

2.27  0 

PNC 

10.49 

.63 

0 

11.55 

2.05  0 

L 

00 

6.86 

.33 

0 

8.62 

1.34  |  0 

PGS 

PC 

6.59 

.33 

0 

8.34 

1.32  |  0 

FNC 

6.60 

.33 

0 

8.40 

1.33  I  0 

L 

CO 

6.55 

.32 

0 

8.13 

1.41  0 

GS 

PC 

6.32 

.34 

0 

7.77 

1.38  0 

PNC 

6.25 

.32 

0 

7.70 

1.41  0 

SOR 

L 

CO 

3.30 

.22 

0 

4.00 

.55  0 

7.66  1.14 

6.90  1.09 


For  Jacobi,  the  P.C.  bound  is  the  same  as  the  P.N.C.  bound  and  so  the 
latter  must  be  faster  as  It  Involves  less  calculation.  It  Is  obvious  from 
Table  1  that  P.J.  with  Forteus  bounds  performs  very  well,  and  in  the  next 
section  we  concentrate  on  this  scheme  and  apply  elimination  of  non-optimal 
actions . 

3.  ACTION  ELIMINATION 

MacQueen  [6]  described  how  for  any  bounds  one  can  observe  a  test  to 
identify  actions  that  cannot  optimize  the  right  hand  side  of  (1.1)  and  so 
can  be  permanently  eliminated  from  the  calculation.  Applying  MacQueen' s 
bounds  [6]  and  Porteus's  bound  [7]  for  the  PJ  algorithm  leads  to  the 
following  tests  to  eliminate  action  k.  in  K,  permanently. 


MacQueen 


Porteus 


v*(i)  <  v  (i)  +  BU  -  b  )/ (1-8) 
an  an 


Ai)  <  v*J(i)  +  g2(an  .  -  bn  J/Cl-g) 
n  a  a-i  n— ± 


(3.1) 


(3.2) 


where  a  -  min(v  (i)  -  v  .(!))  , 
n  ^  a  a-i 


b  »  max(v(i)  -  v  . (i)) 
n  ^  n  n— i 


We  looked  at  four  ways  of  implementing  these  tests. 

Ml.  At  nC**  iteration,  calculate  and  store  v  (i)  for  each  i  .  Then 

tl 

calculate  a  and  b  .  Recalculate  v  (i)  and  use  (3.1)  to  test 
n  n  n 

for  elimination. 

M2.  At  n**1  stage,  calculate  and  store  all  v^(i)  •  Hence  calculate 

v  (i)  ,  a  ,  b  and  test  for  elimination  using  (3.1)  without 
n  n  n 

recalculating  v^(i)  . 

PI.  At  n+ltfl  stage,  calculate  v^+^(i)  ,  starting  with  action  k  that 
maximized  vj(i)  at  previous  stage.  Apply  (3.2)  as  soon  as  you  cal- 

]i£ 

culate  each  v^+^(i)  using  as  d  the  one  that  gives  maximum 
vq+^(1)  so  far  calculated,  see  £ 7 J . 

P2.  At  n+lC^  stage,  calculate  and  store  v^+^(i)  .  Then  using 
\+1(D  -  VQ+1(i)  apply  (3.2). 

As  Table  2  shows  M2  is  far  superior  to  Ml,  but  PI  and  P2  give  similar 

results.  All  three  cut  the  average  time  by  a  half  though. 

Hastings  and  Van  Nunen  (2]  pointed  out  that  one  could  also  eliminate 

actions  temporarily,  i.e.  actions  that  will  not  be  the  optimizing  actions 
at  the  next  iteration  of  the  PJ  algorithm.  This  is  based  on  the  inequality 


va«l,)  '  i  V1’  '  vn(1)  ‘  8  &  <bn+j-l  '  Vj-1> 


(3.3) 
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If  the  R.H.S.  of  (3.3)  is  positive  k  will  not  optimize  the  n  +  a — 
iteration,  and  in  that  case,  at  the  n  +  s  +  1—  iteration  we  need  only 

subtract  another  3(b  .  -  a  )  from  this  positive  number  to  test  if  k 

n*rS  IlTS 

could  be  optimal.  If  action  k  is  not  eliminated  at  the  n  +  s — 

If 

iteration,  v  .  (i)  -  v  .  (i)  is  stored  for  the  test  at  the  next  iteration, 
n+s  uts 

We  looked  at  four  ways  of  implementing  these  two  elimination  procedures. 
Recall  that  the  n  +  1—  iteration  consists  of  the  following  sequence  of 
calculations. 


a  ,b 
n  n 


Vi(1) 


(III) 


an+l ’ ^n+l 


Vn+2(i> 


TEMP  HVN.  Hastings  and  Van  Nunen  [2]  suggested  the  temporary  elimination 

test  be  made  at  (I)  and  if  k  was  not  temporarily  eliminated  then  vQ+1(i) 

was  calculated.  The  permanent  elimination  test  was  made  at  (II)  using 

(3.2)  with  vd  , (i)  replaced  by  a  lower  bound  v  (i)  +  3a  .  If  the 
n+i  n  n 

action  is  not  permanently  eliminated,  vn(i)  +  3an  “  vQ+^(i)  (rather  than 

Ir 

VQ+^(i)  -  VQ+^(i))  is  stored  TEMP  +  PI.  Temporary  elimination  occurs  at 
(I)  and  permanent  elimination  at  (II)  using  PI  . 

TEMP  +  P2.  Again  this  has  temporary  elimination  at  (I)  and  permanent 
elimination  at  (III)  using  P2  . 

TEMP  +  M2.  Temporary  elimination  occurs  at  (I)  and  in  this  case  vQ+^(i) 

was  stored  until  (IV)  and  then  the  M2  technique  used.  If  action  k  was 

not  eliminated  v  . , (i)  -  v  , , (i)  was  stored  for  the  temporary  elimination 
trrl  n-ri 

test  of  the  next  iteration,  which  followed  immediately. 

In  this  case  when  permanent  and  temporary  elimination  are  done  at 


the  same  stage,  it  is  obvious  that  any  action  which  is  permanently  eliminated 
would  also  be  temporarily  eliminated. 


This  leads  us  to  ask  Is  It  worth  permanently  eliminating,  so 


TEMP  ONLY  Perform  the  temporary  elimination  test  at  (I).  If  action  k  is  not 
eliminated  vQ+^(i)  is  stored  through  state  (II)  and  changed  to 
vQ+1(i)  -  vn+1(i)  st  stage  (III)  for  the  next  temporary  elimination  test 
at  (IV). 

Table  2  describes  the  results  and  shows  that  temporary  elimination 
further  cuts  the  time  by  25%,  and  that  pure  temporary  elimination  might 
be  particularly  good  on  large  scale  problems. 

TABLE  2 


METHOD 

CLASS  1  (100  STATE) 

CLASS  2  (40  STATE) 

CLASS  3  (10  STATE) 

AV. 

S.D. 

N. 

AV. 

S.D. 

N. 

AV. 

S.D. 

N. 

Ml 

1.51 

.09 

0 

0.42 

.06 

0 

M2 

0.81 

.05 

15 

0.36 

15 

_ 

0.25 

.03 

15 

PI 

0.87 

.05 

D 

0.43  .05 

D 

0.26 

.03 

0 

P2 

0.88 

.05 

0 

0.45  |  .05 

0 

0.28 

.04 

0 

nm 

0.62 

.03 

0 

0.27 

0.03 

0 

0.21 

.03 

D 

0.60 

.04 

n 

0.25 

CM 

O 

• 

3 

0.20 

.03 

n 

' 

0.58 

.03 

D 

0.26 

.02 

o 

0.23 

.03 

0 

0.59 

.04 

0 

0.22 

.04 

2 

0.21 

.03 

D 

TEMP  ONLY 

0.55 

.03 

15 

0.22 

.04 

10 

0.21 

.03 

0 

Our  object  has  not  been  to  obtain  a  best  buy,  but  to  give  some  idea  of 
the  merits  of  the  various  schemes,  bounds  and  Improvements.  Obviously  for 
more  structured  problems,  algorithms  which  exploit  the  structure  will  be 
at  an  advantage. 
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